Static Partition

Usually when loading files (big files) into Hive tables static partitions are preferred. That saves your time in loading data compared to dynamic partition. You "statically" add a partition in table and move the file into the partition of the table. Since the files are big they are usually generated in HDFS. You can get the partition column value form the filename, day of date etc without reading the whole big file.In static partitioning, we need to specify the partition column value in each and every LOAD statement.

Hive Static Partitioning
  • Insert input data files individually into a partition table is Static Partition.
  • Usually when loading files (big files) into Hive tables static partitions are preferred.
  • Static Partition saves your time in loading data compared to dynamic partition.
  • You “statically” add a partition in the table and move the file into the partition of the table.
  • We can alter the partition in the static partition.
  • You can get the partition column value from the filename, day of date etc without reading the whole big file.
  • If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict This property set by default in hive-site.xml
  • Static partition is in Strict Mode.
  • You should use where clause to use limit in the static partition.
  • You can perform Static partition on Hive Manage table or external table.
Partitioning data is often used for distributing load horizontally, this has performance benefit,and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department. For a faster query response Hive table can be PARTITIONED BY (country string,dept string).Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.

If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

Create Partitioned Table 
create table user
(
firstname VARCHAR(64),
lastname  VARCHAR(64),
address   STRING,
city       VARCHAR(64),
post      STRING,
phone1    VARCHAR(64),
phone2    STRING,
email     STRING,
web       STRING
)
PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
row format delimited fields terminated by '\t'
LINES TERMINATED BY '\n' STORED AS TEXTFILE

Data File Data

To she see the table Partition
Show partitions Table Name

Load Data into Table
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/data/Hive_Data/sampl_data1.txt' INTO TABLE USER_SP  PARTITION (country = 'US', state = 'CA');

INSERT OVERWRITE TABLE  USER_SP PARTITION (country = 'US', state = 'NY') SELECT Firstname,Lastname,Address,City,Postal Code,Phone1,Phone2,Email,Web FROM USER  WHERE  country = 'US' AND state = 'NY';

Internally, the data for each partition will be stored as separate files under separate sub-directories. We can see the physical storage by going to the HDFS location of the table.

hadoop fs -ls /apps/hive/warehouse/db_name.db/USER/country=US/state=NY/

How to get data from specific partition in Hive table?
show partitions TABLENAME
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
pt=2012.07.28.10/is_complete=1
pt=2012.07.28.11/is_complete=1
select * from TABLENAME where pt='2012.07.28.10/is_complete=1' limit 1;

Note: In Static partition some risky queries are not allowed to run. They include:
  • Cartesian Product.
  • No partition being picked up for a query.
  • Comparing bigints and strings.
  • Comparing bigints and doubles.
  • Orderby without limit.

No comments:

Post a Comment